8 research outputs found

    A Word Counting Graph

    Get PDF
    We study methods for counting occurrences of words from a given set H over an alphabet V in a given text. All words have the same length m. Our goal is the computation of the probability to find p occurrences of words from a set H in a random text of size n, assuming that the text is generated by a Bernoulli or Markov model. We have designed an algorithm solving the problem; the algorithm relies on traversals of a graph, whose set of vertices is associated with the overlaps of words from H. Edges define two oriented subgraphs that can be interpreted as equivalence relations on words of H. Let P (H) be the set of equivalence classes and S be the set of other vertices. The run time for the Bernoulli model is O(np(|P (H)| +|S|)) time and the space complexity is O(pm|S| +|P (H)|). In a Markov model of order K, additional space complexity is O(pm|V | K ) and additional time complexity is O(npm|V | K). Our preprocessing uses a variant of Aho-Corasick automaton and achieves O(m|H|) time complexity. Our algorithm is implemented and provides a significant space improvement in practice. We compare its complexity to the additional improvement due to AhoCorasick minimization

    Efficient seeding techniques for protein similarity search

    Get PDF
    We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets.We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix

    Efficient seeding techniques for protein similarity search

    Get PDF
    We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets.We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix

    On subset seeds for protein alignment

    Get PDF
    We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard BLASTP seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in BLASTP and vector seeds, our seeds show a similar or even better performance than BLASTP on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds vs. BLASTP.Comment: IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009

    Minimized Compact Automaton for Clumps over Degenerate Patterns

    No full text
    Clumps are sequences of overlapping occurrences of a given pattern that play a vital role in the study of distribution of pattern occurrences. These distributions are sused for finding functional fragments in biological sequences. In this paper we present a minimized compacted automaton (Overlap walking automaton,OWA) recognizing all the possible clumps for degenerate patterns and its usage for computation of probabilities of sets of clumps. We also present Aho-Corasicklike automaton, RMinPatAut, recognizing all the sequences ending with pattern occurrences. The statesof RMinPatAut are equivalence classes on the prefixes of the pattern words. We have proved thatRMinPatAut is Nerode-minimal, i.e., minimal in classical sense. We use RMinPatAut as an auxiliary structure for OWA construction. For degenerate pattern,RMinPatAut can be constructed in linear time on the number of its states (it is bounded by 2m, where m the length of pattern words).OWA can be constructed in linear time on the sum of its size and RMinPatAut size
    corecore